arXiv: 1504.00325v2 [cs.CV] 3 Apr 2015 


1 


Microsoft COCO Captions: Data Collection and 

Evaluation Server 

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam 
Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick 


Abstract —In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When completed, the dataset will 
contain over one and a half million captions describing over 330,000 images. For the training and validation images, five independent 
human generated captions will be provided. To ensure consistency in evaluation of automatic caption generation algorithms, an 
evaluation server is used. The evaluation server receives candidate captions and scores them using several popular metrics, including 
BLEU, METEOR, ROUGE and CIDEr. Instructions for using the evaluation server are provided. 

- > - 


1 Introduction 

The automatic generation of captions for images is a 
long sfanding and challenging problem in arfificial in- 
felligence [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], 
[12], [13], [14], [15], [16], [17], [18], [19]. Research in 
fhis area spans numerous domains, such as compufer 
vision, nafural language processing, and machine learn¬ 
ing. Recenfly fhere has been a surprising resurgence of 
inferesf in fhis area [20], [21], [22], [23], [24], [25], [26], 
[27], [28], [29], [30], due fo fhe renewed inferesf in neural 
network learning fechniques [31], [32] and increasingly 
large dafasefs [33], [34], [35], [7], [36], [37], [38]. 

In fhis paper, we describe our process of collecting 
captions for fhe Microsoff COCO Caption dataset, and 
the evaluation server we have set up to evaluate perfor¬ 
mance of differenf algorifhms. The MS COCO capfion 
dafasef confains human generafed captions for images 
confained in fhe Microsoff Common Objecfs in COnfexf 
(COCO) dafasef [38]. Similar fo previous dafasefs [7], 
[36], we collecf our captions using Amazon's Mechanical 
Turk (AMT). Upon completion of fhe dafasef if will 
confain over a million capfions. 

When evaluafing image capfion generafion algo¬ 
rithms, it is essential that a consistent evaluation protocol 
is used. Comparing results from different approaches can 
be difficult since numerous evaluation metrics exist [39], 
[40], [41], [42]. To further complicate matters the imple¬ 
mentations of fhese mefrics offen differ. To help alleviafe 
fhese issues, we have builf an evaluafion server fo enable 
consisfency in evaluafion of differenf capfion generafion 
approaches. Using fhe fesfing dafa, our evaluafion server 
evaluafes capfions outpuf by different approaches using 
numerous automatic metrics: BLEU [39], METEOR [41], 



A horse carrying a large load of hay and 
two people sitting on it. 


Bunk bed with a narrow shelf sitting 
underneath it. 


Eig. 1: Example images and captions from fhe Microsoff 
COCO Capfion dafasef. 


ROUGE [40] and CIDEr [42]. We hope fo augmenf fhese 
resulfs wifh human evaluafions on an annual basis. 

This paper is organized as follows: Eirsf we describe 
fhe dafa collection process. Nexf, we describe fhe capfion 
evaluafion server and fhe various mefrics used. Human 
performance using these metrics are provided. Einally 
the annotation format and instructions for using fhe eval¬ 
uafion server are described for fhose who wish fo submif 
resulfs. We conclude by discussing fufure directions and 
known issues. 


• Xinlei Chen is with Carnegie Mellon University. 

• Hao Fang is with the University of Washington. 

• T.Y. Lin is with Cornell NYC Tech. 

• Ramakrishna Vedantam is with Virginia Tech. 

• Saurabh Gupta is with the Univeristy of California, Berkeley. 

• P. Dollar is with Facebook AI Research. 

• C. L. Zitnick is with Microsoft Research, Redmond. 


2 Data Collection 

In fhis section we describe how fhe dafa is gafhered 
for fhe MS COCO capfions dafasef. Eor images, we use 
fhe dafasef collecfed by Microsoff COCO [38]. These 
images are splif info framing, validafion and fesfing sefs. 















2 


The images were gathered by searching for pairs of 80 
objecf cafegories and various scene fypes on Flickr. The 
goal of fhe MS COCO image collection process was fo 
gafher images confaining multiple objecfs in fheir nafural 
confexf. Given fhe visual complexify of mosf images 
in fhe dafasef, fhey pose an inferesfing and difficulf 
challenge for image capfioning. 

For generafing a dafasef of image captions, fhe same 
framing, validafion and fesfing sefs were used as in fhe 
original MS COCO dafasef. Two dafasefs were collecfed. 
The firsf dafasef MS COCO c5 confains five reference 
captions for every image in fhe MS COCO framing, 
validafion and fesfing dafasefs. The second dafasef MS 
COCO c40 confains 40 reference senfences for a ran¬ 
domly chosen 5,000 images from fhe MS COCO fesfing 
dafasef. MS COCO c40 was creafed since many auto¬ 
matic evaluafion mefrics achieve higher correlafion wifh 
human judgemenf when given more reference senfences 
[42]. MS COCO c40 may be expanded fo include fhe MS 
COCO validafion dafasef in fhe fufure. 

Our process for gafhering captions received significanf 
inspiration from fhe work of Young efal. [36] and Ho- 
dosh efal. [7] fhaf collecfed capfions on Flickr images 
using Amazon's Mechanical Turk (AMT). Each of our 
capfions are also generated using human subjecfs on 
AMT. Each subjecf was shown fhe user interface in 
Eigure 2. The subjecfs were insfrucfed fo: 

• Describe all fhe imporfanf parfs of fhe scene. 

• Do nof sfarf fhe senfences wifh "There is. 

• Do nof describe unimporfanf defails. 

• Do nof describe fhings fhaf mighf have happened 
in fhe fufure or pasf. 

• Do nof describe whaf a person mighf say. 

• Do nof give people proper names. 

• The senfences should confain af leasf 8 words. 

The number of capfions gafhered is 413,915 capfions for 
82,783 images in framing, 202,520 capfions for 40,504 
images in validafion and 379,249 capfions for 40,775 
images in fesfing including 179,189 for MS COCO c5 and 
200,060 for MS COCO c40. Eor each fesfing image, we 
collecfed one additional caption fo compufe fhe scores 
of human performance for comparing scores of machine 
generated capfions. The fofal number of collecfed cap¬ 
fions is 1,026,459. We plan fo collecf capfions for fhe MS 
COCO 2015 dafasef when if is released, which should 
approximafely double fhe size of fhe capfion dafasef. 
The AMT inferface may be obfained from fhe MS COCO 
website. 

3 Caption evaluation 

In fhis section we describe fhe MS COCO capfion evalu¬ 
afion server. Insfrucfions for using fhe evaluafion server 
are provided in Section 5. As inpuf fhe evaluafion server 
receives candidafe capfions for bofh fhe validafion and 
fesfing dafasefs in fhe formaf specified in Secfion 5. The 
validafion and fesf images are provided fo fhe submif- 
fer. However, fhe human generafed reference senfences 



Instructions: 

• Descdbe all the important parts of 
the scene. 

• Do not start the sentences with 
"There is'. 

• Do not describe unimportant 
details. 

• Do not describe things that might 
have happened in the future or past. 

• Do not describe what a person 
might say. 

• Do not give people proper names. 

• The sentence should contain at 
least 8 words. 


Please describe the image: 

-ler oescnpoon here 



Eig. 2: Example user interface for the caption gathering 
task. 


are only provided for the validation set. The reference 
sentences for the testing set are kept private to reduce 
the risk of overfitting. 

Numerous evaluation metrics are computed on both 
MS COCO c5 and MS COCO c40. These include BLEU- 
1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR and 
GIDEr-D. The details of the these metrics are described 
next. 


3.1 Tokenization and preprocessing 

Both the candidate captions and the reference captions 
are pre-processed by the evaluation server. To tokenize 
the captions, we use Stanford PTBTokenizer in Stanford 
CoreNLP tools (version 3.4.1) [43] which mimics Perm 
Treebank 3 tokenization. In addition, punctuations^ are 
removed from the tokenized captions. 

3.2 Evaluation metrics 

Our goal is to automatically evaluate for an image li 
the quality of a candidate caption given a set of 
reference captions Si = {s^i,..., G S. The caption 
sentences are represented using sets of n-grams, where 
an n-gram Wfc S fl is a set of one or more ordered words. 
In this paper we explore n-grams with one to four words. 
No stemming is performed on the words. The number 
of times an n-gram occurs in a sentence Sy is denoted 
hk{sij) or hk{ci) for the candidate sentence ct S C. 


3.3 BLEU 


BLEU [39] is a popular machine translation metric that 
analyzes the co-occurrences of n-grams between the 
candidate and reference sentences. It computes a corpus- 
level clipped n-gram precision between sentences as 
follows: 


GP„(G,5) = 


Li Y.k min(hfc(ci), maxhfc(sy)) 


IGm 




( 1 ) 


1. The full list of punctuations: ",',', -LRB-, -RRB-, -LCB-, -RCB-, 

?, !, „ ..., 






3 


where k indexes the set of possible n-grams of lengfh n. 
The clipped precision mefric limifs fhe number of fimes 
an n-gram may be counfed fo fhe maximum number 
of fimes if is observed in a single reference senfence. 
Nofe fhaf CPn is a precision score and if favors shorf 
senfences. So a brevify penalfy is also used: 


b{C,S) 


1 if Ic > Is 

fA-lsIlc if l(y <lg' 


( 2 ) 


where Ic is fhe fofal lengfh of candidafe sentences c/s 
and Is is the length of fhe corpus-level effecfive refer¬ 
ence lengfh. When fhere are multiple references for a 
candidafe senfence, we choose fo use fhe closest reference 
lengfh for fhe brevify penalfy. 

The overall BLEU score is compufed using a weighfed 
geomefric mean of fhe individual u-gram precision: 


BLEUn{C, S) = b{C, S) exp 


N 


Wn log CPn{C, S) 


(3) 


where N = 1,2,3,4 and Wn is f 5 rpically held consfanf for 
all n. 

BLEU has shown good performance for corpus- 
level comparisons over which a high number of n- 
gram matches exist. However, at a sentence-level the 
n-gram matches for higher n rarely occur. As a resulf, 
BLEU performs poorly when comparing individual sen- 
fences. 


3.4 ROUGE 


ROUGE [40] is a sef of evaluation mefrics designed fo 
evaluafe fexf summarizafion algorifhms. 

1) ROUGEjv: The firsf ROUGE mefric compufes a 
simple n-gram recall over all reference summaries 
given a candidafe senfence: 


2 ) 


ROUGEn{cuS,) 


J2k min(/ife(ci), )) 

hk[sij) 

(4) 


ROUGE^,: ROUGE/, uses a measure based on fhe 
Longesf Gommon Subsequence (LGS). An LGS is 
a sef words shared by fwo senfences which occur 
in fhe same order. However, unlike n-grams fhere 
may be words in befween fhe words fhaf creafe 
fhe LGS. Given fhe lengfh l(ci,Sij) of fhe LGS 
befween a pair of senfences, ROUGE/, is found by 
compufing an E-measure: 


Ri 


Pi 


l(Ci, Sij ) 

= max —I-^ 

i Is,; 


O I 
l{Ci , Sij ) 


= max 

3 \Ci 


ROUGEL{c,,Si) 


(1 + P^)RiPi 

Ri + P^Pi 


(5) 

( 6 ) 
(7) 


Ri and Pi are recall and precision of LGS. /3 is 
usually sef fo favor recall {j3 = 1.2). Since n- 
grams are implicif in fhis measure due fo fhe use 
of fhe LGS, fhey need nof be specified. 

3) ROUGEg: The final ROUGE mefric uses skip bi¬ 
grams insfead of fhe LGS or n-grams. Skip bi-grams 
are pairs of ordered words in a senfence. However, 
similar fo fhe LGS, words may be skipped befween 
pairs of words. Thus, a senfence with 4 words 
would have G| = 6 skip bi-grams. Precision and 
recall are again incorporated to compute an E- 
measure score. If fk{sij) is fhe skip bi-gram counf 
for senfence s^, ROUGEs is compufed as: 


Rs = max 
J 


Ps = max 

3 


T.k™^^Uk{Ci)Jk{.S^)) 


J2k fkiSij) 


( 8 ) 


Efcinin(/fc(c,),/fc(s„)) 


Efe/fe(ci) 


(9) 


ROUGEs{c„S,) 


{l+P^)RsPs 
Rs + P^Ps 


( 10 ) 


Skip bi-grams are capable of capfuring long range 
senfence sfrucfure. In practice, skip bi-grams are 
compufed so fhaf fhe componenf words occur af a 
disfance of af mosf 4 from each ofher. 


3.5 METEOR 


METEOR [41] is calculafed by generating an alignmenf 
befween fhe words in fhe candidafe and reference sen- 
fences, wifh an aim of 1:1 correspondence. This align¬ 
menf is compufed while minimizing fhe number of 
chunks, ch, of contiguous and identically ordered tokens 
in fhe senfence pair. The alignmenf is based on exacf 
foken mafching, followed by WordNef S 5 monyms [44], 
sfemmed tokens and fhen paraphrases. Given a sef of 
alignmenfs, m, fhe METEOR score is fhe harmonic mean 
of precision Pm and recall Rm befween fhe besf scoring 
reference and candidafe: 


F„ 


P B 

OiPm (1 — Ol)B m 

p _ 1^1 

J m, — 


Bm — 


hk{ci) 

\m\ 

hk{sij) 


METEOR = (1 - Pen)E„ 


( 11 ) 

( 12 ) 

(13) 

(14) 

(15) 


Thus, fhe final METEOR score includes a penalfy Pen 
based on chunkiness of resolved matches and a har¬ 
monic mean term that gives the quality of fhe resolved 
mafches. The defaulf parameters a, 7 and B are used for 
fhis evaluation. Nofe fhaf similar to BLEU, statistics of 
precision and recall are firsf aggregated over fhe enfire 
corpus, which are fhen combined fo give fhe corpus-level 
METEOR score. 












4 


3.6 CIDEr 

The CIDEr metric [42] measures consensus in image 
captions by performing a Term Frequency Inverse Doc¬ 
ument Frequency (TF-IDF) weighting for each n-gram. 
The number of fimes an n-gram ojk occurs in a reference 
senfence stj is denofed by hk{sij) or hk{ci) for fhe candi- 
dafe senfence c^. CIDEr compufes fhe TF-IDF weighting 
gk(sij) for each n-gram ojk using: 


9k{^ij) — 

where fl is fhe vocabulary of all n-grams and / is fhe 
sef of all images in fhe dafasef. The firsf ferm measures 
fhe TF of each n-gram ujk, and fhe second ferm measures 
fhe rarify of ujk using ifs IDF. Infuifively, TF places higher 
weigh! on n-grams fhaf frequenfly occur in fhe reference 
senfences describing an image, while IDF reduces fhe 
weigh! of n-grams fhaf commonly occur across all de¬ 
scriptions. Thai is, fhe IDF provides a measure of word 
saliency by discounfing popular words fhaf are likely fo 
be less visually informative. The IDF is compufed using 
fhe logarifhm of fhe number of images in fhe dafasef |/| 
divided by fhe number of images for which oj^ occurs 
in any of ifs reference senfences. 

The CIDEr„ score for n-grams of lengfh n is com¬ 
pufed using fhe average cosine similarify befween fhe 
candidafe senfence and fhe reference senfences, which 
accounfs for bofh precision and recall: 


TABLE 1: Human Agreemenf for Image Captioning: 
Various mefrics when benchmarking a human generafed 
caption againsf ground frufh capfions. 


Metric Name 

MS COCO c5 

MS COCO c40 

BLEU 1 

0.663 

0.880 

BLEU 2 

0.469 

0.744 

BLEU 3 

0.321 

0.603 

BLEU 4 

0.217 

0.471 

METEOR 

0.252 

0.335 

ROUGEi 

0.484 

0.626 

CIDEr-D 

0.854 

0.910 


CIDEr-D„(Q, 5,) = — y e-5^^-* 

3 

fnin{g‘^{c,),g'^{s,j)) ■ 

\\g-icM9-is^3)\\ ’ ^ ^ 

Where l{ci) and l{sij) denofe fhe lengfhs of candidafe 
and reference senfences respecfively. u = 6 is used. A 
facfor of 10 is used in fhe numerator fo make fhe CIDEr- 
D scores numerically similar fo fhe ofher mefrics. 

The final CIDEr-D mefric is compufed in a similar 
manner fo CIDEr (analogous fo eqn. 18): 

N 

CIDEr-D(c„ S,) = ^ zi;„CIDEr-D„(c„ S,), (20) 

n—1 

Note fhaf jus! like fhe BLEU and ROUGE mefrics, CIDEr- 
D does no! use sfemming. We adopf fhe CIDEr-D mefric 
for fhe evaluation server. 


CIDEr„(ci,5'i) 


1 ■ g"'(sii) 

m^||g"(c.)||||g-(s,,)r 


(17) 


where g‘^{ci) is a vector formed by gk{ci) corresponding 
fo all n-grams of lengfh n and ||g"'(ci)|| is fhe magnifude 
of fhe vector g'^{ci). Similarly for g'^{sij). 

Higher order (longer) n-grams fo are used fo cap- 
fure grammafical properties as well as richer semantics. 
Scores from n-grams of varying lengfhs are combined as 
follows: 


N 

CIDEr(ci, S'i) = y w„CIDEr„(ci, Si), (18) 

n—1 

Uniform weighfs are used = 1/A. A = 4 is used. 

CIDEr-D is a modificafion fo CIDEr fo make if more 
robusf fo gaming. Gaming refers fo fhe phenomenon 
where a senfence fhaf is poorly judged by humans fends 
fo score highly wifh an aufomafed mefric. To defend fhe 
GIDEr mefric againsf gaming effecfs, [42] add clipping 
and a lengfh based gaussian penalfy fo fhe CIDEr mefric 
described above. This resulfs in fhe following equations 
for CIDEr-D: 


4 Human performance 

In fhis section, we sfudy fhe human agreemenf among 
humans af fhis fask. We sfarf wifh analyzing fhe infer- 
human agreemenf for image captioning (Secfion. 4.1) and 
fhen analyze human agreemenf for fhe word predicfion 
sub-fask and provide a simple model which explains 
human agreemenf for fhis sub-fask (Secfion. 4.2). 

4.1 Human Agreement for Image Captioning 

When examining human agreement on captions, it be¬ 
comes clear that there are many equivalent ways to 
say essentially the same thing. We quantify this by 
conducting the following experiment: We collect one 
additional human caption for each image in the test 
set and treat this caption as the prediction. Using the 
MS COCO caption evaluation server we compute the 
various metrics. The results are tabulated in Table 1. 

4.2 Human Agreement for Word Prediction 

We can do a similar analysis for human agreement at the 
sub-task of word prediction. Consider the task of tagging 
the image with words that occur in the captions. For this 
task, we can compute the human precision and recall for 












5 


TABLE 2: Model defintions. 


and q are: 


o = object or visual concept 

w = word associated with o 

n = total number of images 

k = number of captions per image 
q = P(o=l) 

p = P[w = l|o = 1) 


a given word w by benchmarking words used in the fc+1 
human caption with respect to words used in the first k 
reference capfions. Nofe fhaf we use weighfed versions 
of precision and recall, where each negafive image has 
a weigh! of 1 and each posifive image has a weigh! 
equal fo fhe number of capfions confaining fhe word 
w. Human precision (Hp) and human recall {Hr) can be 
compufed from fhe counfs of how many subjecfs ouf of 
k use fhe word w fo describe a given image over fhe 
whole dafasef. 

We plof Hp versus Hr for a sef of nouns, verbs and 
adjecfives, and all 1000 words considered in Figure 3. 
Nouns referring fo animals like 'elephanf' have a high 
recall, which means fhaf if an 'elephanf' exisfs in the 
image, a subject is likely to talk about it (which makes 
intuitive sense, given 'elephant' images are somewhat 
rare, and there are no alternative words that could 
be used instead of 'elephanf'). On fhe ofher hand, an 
adjective like 'brighf' is used inconsisfenfly and hence 
has low recall. Inferesfingly words wifh high recall also 
have high precision. Indeed, all fhe poinfs of human 
agreemenf appear fo lie on a one-dimensional curve in 
fhe fwo-dimension precision-recall space. 

This observation mofivafes us fo propose a simple 
model for when subjecfs use a particular word w for 
describing an image. Lef o denofe an objecf or visual 
concepf associafed wifh word w, n be fhe fofal number of 
images, and k be fhe number of reference capfions. Nexf, 
lef q = P{o = 1) be fhe probabilify fhaf objecf o exisfs in 
an image. For clarify fhese definitions are summarized 
in Table 2. We make two simplifications. Firsf, we ig¬ 
nore image level saliency and insfead focus on word level 
saliency. Specifically, we only model p = P{w = l|o = 1), 
fhe probabilify a subjecf uses w given fhaf o is in the 
image, without conditioning on the image itself. Second, 
we assume fhat P{w = l|o = 0) = 0, i.e. fhaf a subjecf 
does no! use w unless o is in fhe image. As we will 
show, even wifh fhese simplificafions our model suffices 
fo explain fhe empirical observations in Figure 3 fo a 
reasonable degree of accuracy. 

Given fhese assuinpfions, we can model human preci¬ 
sion Hp and recall Hr for a word w given only p and k. 
Firsf, given k capfions per image, we need fo compufe 
fhe expecfed number of (1) capfions containing w (cw), 
(2) true positives (tp), and (3) false positives (fp). Nofe 
fhat in our definition there can be up to k true positives 
per image (if cw = k, i.e. each of fhe k capfions confains 
word w) buf af mosf 1 false posifive (if none of fhe k 
capfions confains w). The expecfafions, in ferms of k, p, 


E[cw] 


E[tp] 


E[fp] 


Et,P{w^ = 1 ) 

E,P{w^ = l\o= l)P(o= 1) 
-tSiP(r(;* = l|o = 0)P(o = 0) 


kpq -I- 0 = kpq 

= 1 A = 1) 

E,P{w^ = 1 A = l|o = l)P(o = 
+i:iP{w^ = 1 A = l|o = 0)P(o 


1 ) 

= 0 ) 


kppq + 0 



P{w^ ...w'" = 0Aw 
P(o = 1 A 


-|-P(o = 0 A ... w 


fc+i ^ 

= 0 A = 
^ = 0 A 


q{l-p)^p + 0 


g(l-p)"p 


1 ) 

= 1 ) 


In fhe above w* = 1 denofes fhaf w appeared in fhe 
caption. Nofe fhaf we are also assuming independence 
befween subjecfs conditioned on o. We can now define 
model precision and recall as: 


nE[tp] pk 

nE[tp\ nE[fp\ pk + {1 — p)^ 
nE[tp] 

■= —i 

nE [cw] 


Nofe fhaf fhese expressions are independenf of q and 
only depend on p. Inferesfingly, because of fhe use of 
weighfed precision and recall, fhe recall for a cafegory 
comes ouf fo be exacfly equal fo p, fhe probabilify a 
subjecf uses w given fhaf o is in fhe image. 

We sef k — A and vary p fo plof Hp versus Hr, 
geffing the curve as shown in blue in Figure 3 (bottom 
left). The curve explains the observed data quite well, 
closely matching the precision-recall tradeoffs of fhe 
empirical dafa (alfhough no! perfecfly). We can also 
reduce fhe number of capfions from four, and look af 
how fhe empirical and predicfed precision and recall 
change. Figure 3 (bottom righf), shows fhis variation as 
we reduce the number of reference captions per image 
from four fo one annofafions. We see fhaf fhe poinfs of 
human agreemenf remain at the same recall value, but 
decrease in their precision, which is consistent with what 
the model predicts. Also, the human precision at infinite 
subjects will approach one, which is again reasonable 
given that a subject will only use the word w if fhe 
corresponding objecf is in fhe image (and in fhe presence 
of infinife subjecfs someone else will also use fhe word 
w). 

In facf, fhe fixed recall value can help us recover 
p, fhe probabilify fhat a subject will use the word w 
in describing the image given the object is present. 
Nouns like 'elephant' and 'tennis' have large p, which 
is reasonable. Verbs and adjectives, on the other hand, 
have smaller p values, which can be justified from fhe 
facf fhaf a) subjecfs are less likely fo describe affribufes 










Adjectives 





‘man 'elephant 

‘teddy 

“ 

•boy'botde 

double 

“ 


‘red 

07 

^.‘ggraon 

‘black 


sidewalk 


r “ 

c 

wooden ^ 


• :« 

‘bluny ^ 








•pier 

dry 

01 




railroad 


oz 




‘apartment 




‘bright 



huge 




running 


driving 

looking 


working” 

making 




Fig. 3: Precision-recall points for human agreement: we compute precision and recall by treating one human caption 
as prediction and benchmark it against the others to obtain points on the precision recall curve. We plot these points 
for example nouns (top left), adjectives (top center), and verbs (top right), and for all words (bottom left). We also 
plot the fit of our model for human agreement with the empirical data (bottom left) and show how the human 
agreement changes with different number of captions being used (bottom right). We see that the human agreement 
point remains at the same recall value but dips in precision when using fewer captions. 


of objects and b) subjects might use a different word 
(s 5 monym) to describe the same attribute. 

This analysis of human agreement also motivates us¬ 
ing a different metric for measuring performance. We 
propose Precision at Human Recall (PHR) as a metric 
for measuring performance of a vision system perform¬ 
ing this task. Given that human recall for a particular 
word is fixed and precision varies with the number of 
armotations, we can look at system precision at human 
recall and compare it with human precision to report the 
performance of the vision system. 

5 Evaluation Server Instructions 

Directions on how to use the MS COCO caption evalu¬ 
ation server can be found on the MS COCO website. 
The evaluation server is hosted by CodaLab. To par¬ 
ticipate, a user account on CodaLab must be created. 
The participants need to generate results on both the 
validation and testing datasets. When training for the 
generation of results on the test dataset, the training 
and validation dataset may be used as the participant 
sees fit. That is, the validation dataset may be used for 
training if desired. However, when generating results on 
the validation set, we ask participants to only train on 
the training dataset, and only use the validation dataset 


for tuning meta-parameters. Two JSON files should be 
created corresponding to results on each dataset in the 
following format: 

[{ 

“imagejid” : int, 

“caption” : str, 

}] 

The results may then be placed into a zip file and 
uploaded to the server for evaluation. Code is also 
provided on GitHub to evaluate results on the validation 
dataset without having to upload to the server. The 
number of submissions per user is limited to a fixed 
amount. 

6 Discussion 

Many challenges exist when creating an image caption 
dataset. As stated in [7], [42], [45] the captions generated 
by human subjects can vary significantly. However even 
though two captions may be very different, they may 
be judged equally "good" by human subjects. Designing 
effective automatic evaluation metrics that are highly 
correlated with human judgment remains a difficult 
challenge [7], [42], [45], [46]. We hope that by releasing 
















7 


results on the validation data, we can help enable future 
research in this area. 

Since automatic evaluation metrics do not always 
correspond to human judgment, we hope to conduct 
experiments using human subjects to judge the quality of 
automatically generated captions, which are most similar 
to human captions, and whether they are grammatically 
correct [45], [42], [7], [4], [5]. This is essential to determin¬ 
ing whether future algorithms are indeed improving, or 
whether they are merely over fitting to a specific metric. 
These human experiments will also allow us to evaluate 
the automatic evaluation metrics themselves, and see 
which ones are correlated to human judgment. 

References 

[1] K. Barnard and D. Forsyth, "Learning the semantics of words and 
pictures," in ICCV, vol. 2, 2001, pp. 408^15. 

[2] K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and 
M. I. Jordan, "Matching words and pictures," JMLR, vol. 3, pp. 
1107-1135, 2003. 

[3] V. Lavrenko, R. Manmatha, and J. Jeon, "A model for learning 
the semantics of pictures," in NIPS, 2003. 

[4] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. 
Berg, "Baby talk: Understanding and generating simple image 
descriptions," in CVPR, 2011. 

[5] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, 
K. Yamaguchi, T. Berg, K. Stratos, and H. Daume III, "Midge: 
Generating image descriptions from computer vision detections," 
in EACL, 2012. 

[6] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, 
J. Hockenmaier, and D. Forsyth, "Every picture tells a story: 
Generating sentences from images," in ECCV, 2010. 

[7] M. Hodosh, P. Young, and J. Hockenmaier, "Framing image de¬ 
scription as a ranking task: Data, models and evaluation metrics." 
fAIR, vol. 47, pp. 853-899, 2013. 

[8] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, 
"Collective generation of natural image descriptions," in ACL, 
2012 . 

[9] Y. Yang, C. L. Teo, H. Daume III, and Y. Aloimonos, "Corpus- 
guided sentence generation of natural images," in EMNLP, 2011. 

[10] A. Gupta, Y. Verma, and C. Jawahar, "Choosing linguistics over 
vision to describe images." in AAAI, 2012. 

[11] E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran, "Distributional 
semantics in technicolor," in ACL, 2012. 

[12] Y. Feng and M. Lapata, "Automatic caption generation for news 
images," TPAMI, vol. 35, no. 4, pp. 797-812, 2013. 

[13] D. Elliott and E. Keller, "Image description using visual depen¬ 
dency representations." in EMNLP, 2013, pp. 1292-1302. 

[14] A. Karpathy, A. Joulin, and F.-F. Li, "Deep fragment embeddings 
for bidirectional image sentence mapping," in NIPS, 2014. 

[15] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik, 
"Improving image-sentence embeddings using large weakly an¬ 
notated photo collections," in ECCV, 2014, pp. 529-545. 

[16] R. Mason and E. Charniak, "Nonparametric method for data- 
driven image captioning," in ACL, 2014. 

[17] P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi, "Treetalk: Com¬ 
position and compression of trees for image descriptions," TACL, 
vol. 2, pp. 351-362, 2014. 

[18] K. Ramnath, S. Baker, L. Vanderwende, M. El-Saban, S. N. 
Sinha, A. Kannan, N. Hassan, M. Galley, Y. Yang, D. Ramanan, 
A. Bergamo, and L. Torresani, "Autocaption: Automatic caption 
generation for personal photos," in WACV, 2014. 

[19] A. Lazaridou, E. Bruni, and M. Baroni, "Is this a wampimuk? 
cross-modal mapping between distributional semantics and the 
visual world," in ACL, 2014. 

[20] R. Kiros, R. Salakhutdinov, and R. Zemel, "Multimodal neural 
language models," in ICML, 2014. 

[21] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, "Explain im¬ 
ages with multimodal recurrent neural networks," arXiv preprint 
arXiv:U10.1090, 2014. 


[22] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: 
A neural image caption generator," arXiv preprint arXiv:1411.4555, 
2014. 

[23] A. Karpathy and L. Fei-Fei, "Deep visual-semantic alignments 
for generating image descriptions," arXiv preprint arXiv:1412.2306, 
2014. 

[24] R. Kiros, R. Salakhutdinov, and R. S. Zemel, "Unifying visual- 
semantic embeddings with multimodal neural language models," 
arXiv preprint arXiv:1411.2539, 2014. 

[25] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, 
S. Venugopalan, K. Saenko, and T. Darrell, "Long-term recurrent 
convolutional networks for visual recognition and description," 
arXiv preprint arXiv:1411.4389, 2014. 

[26] H. Fang, S. Gupta, F. landola, R. Srivastava, L. Deng, P. Dollar, 
J. Gao, X. He, M. Mitchell, J. Platt et ah, "From captions to visual 
concepts and back," arXiv preprint arXiv:1411.4952, 2014. 

[27] X. Chen and C. L. Zitnick, "Learning a recurrent visual representa¬ 
tion for image caption generation," arXiv preprint arXiv:1411.5654, 
2014. 

[28] R. Lebret, P. O. Pinheiro, and R. Collobert, "Phrase-based image 
captioning," arXiv preprint arXiv:1502.03671, 2015. 

[29] -, "Simple image description generator via a linear phrase- 

based approach," arXiv preprint arXiv:1412.8419, 2014. 

[30] A. Lazaridou, N. T. Pham, and M. Baroni, "Combining language 
and vision with a multimodal skip-gram model," arXiv preprint 
arXiv:1501.02598, 2015. 

[31] A. Krizhevsky, 1. Sutskever, and G. Hinton, "ImageNet classifica¬ 
tion with deep convolutional neural networks," in NIPS, 2012. 

[32] S. Hochreiter and J. Schmidhuber, "Long short-term memory," 
Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997. 

[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Im¬ 
ageNet: A Large-Scale Hierarchical Image Database," in CVPR, 
2009. 

[34] M. Grubinger, P. Clough, H. Muller, and T. Deselaers, "The iapr tc- 
12 benchmark: A new evaluation resource for visual information 
systems," in LREC Workshop on Language Resources for Content- 
based Image Retrieval, 2006. 

[35] V. Ordonez, G. Kulkarni, and T. Berg, "Im2text: Describing images 
using 1 million captioned photographs." in NIPS, 2011. 

[36] P. Yoimg, A. Lai, M. Hodosh, and J. Hockenmaier, "From image 
descriptions to visual denotations: New similarity metrics for 
semantic inference over event descriptions," TACL, vol. 2, pp. 67- 
78, 2014. 

[37] J. Chen, P. Kuznetsova, D. Warren, and Y. Choi, "Deja image- 
captions: A corpus of expressive image descriptions in repetition," 
in NAACL, 2015. 

[38] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, 
P. Dollar, and C. L. Zitnick, "Microsoft COCO: Common objects 
in context," in ECCV, 2014. 

[39] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a method 
for automatic evaluation of machine translation," in ACL, 2002. 

[40] C.-Y. Lin, "Rouge: A package for automatic evaluation of sum¬ 
maries," in ACL Workshop, 2004. 

[41] M. Denkowski and A. Lavie, "Meteor universal: Language spe¬ 
cific translation evaluation for any target language," in EACL 
Workshop on Statistical Machine Translation, 2014. 

[42] R. Vedantam, C. L. Zitnick, and D. Parikh, "Cider: 
Consensus-based image description evaluation," arXiv preprint 
arXiv:1411.5726, 2014. 

[43] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. 
Bethard, and D. McClosky, "The Stanford CoreNLP natural 
language processing toolkit," in Proceedings of 52nd Annual 
Meeting of the Association for Computational Linguistics: System 
Demonstrations, 2014, pp. 55-60. [Online]. Available: http: 
//www.aclweb.org/anthology/P/P14/P14-5010 

[44] G. A. Miller, "Wordnet: a lexical database for english," Communi¬ 
cations of the ACM, vol. 38, no. 11, pp. 39-41, 1995. 

[45] D. Elliott and F. Keller, "Comparing automatic evaluation mea¬ 
sures for image description," in Proceedings of the 52nd Annual 
Meeting of the Association for Computational Linguistics, vol. 2, 2014, 
pp. 452-457. 

[46] C. Callison-Burch, M. Osborne, and P. Koehn, "Re-evaluation the 
role of bleu in machine translation research." in EACL, vol. 6, 
2006, pp. 249-256. 


